Before we can begin any script we first need to make sure that the required packages are installed in our version of RStudio. Next, we can load the required packages to be used in the script. The code block below will do this for you.
# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')
library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)
You should be able to see that we have installed and loaded 3
different packages. Let’s first go over the basics of what a
package is. In its simplest terms, a package is a
toolbox that someone has created for us in R that makes
our life easier. These packages build on the basic code that
comes with the R programming language (what RStudio
uses to run), called base R.
Figure 1: Opening an R package
It is always a good idea to check the documentation
for a package before you use it. We can do this by using the
help syntax, which is the ?. The package
we are trying to get help with is called here. Try to run
this code by clicking on the green arrow on the corner of the
code block on the left side of your screen. This will open a
webpage that tells us the purpose of the
here package and how it works.
Figure 2: Running code in R
?here #? loads the documentation for a specified package.
Fill in the code block below by putting in the
help syntax ? and the name of the package
you are interested in. This will get the documentation for the
other packages we are using. You can do this by
substituting in the packages that we are using from above. Have a read
of each of these pages and click on any links you find interesting.
These are the main packages we will be using throughout
this course.
# Try to use the help function '?' to read more about the packages we are using today
# The packages we are using are 'tidyverse' and 'ggplot2'.
?tidyverse
?ggplot2
The dataset we are using has already been downloaded in the folder containing this R Markdown file. On your computer navigate to this folder and have a look at what it contains.
You should note that it contains the following:
These are the key ingredients needed to organise all projects in R.
Figure 3: Project Organisation
You will notice that the data for today, called
PSYC2001_social-media-data.csv, is a csv
file (short for a Comma Separated Value file). This
means that we will need to import the dataset using a
function capable of importing csv files.
We will be using two different functions to achieve
this. The read.csv() function is used to import our csv
dataset and it comes from the utils package which is part
of base R. But the read.csv() function needs
to know where the file is coming from. To do this, we use the
here() function from the here package. This
function tells R the location of the project we are
working from, to make locating the data easier.
Let’s first confirm that here() knows our current
location on this pc (called the ‘Working
Directory’)
here()
## [1] "G:/Current/Student folders/Bart Cool/Work/PSYC2001 in R/Tutorial 2 - Data wrangling and visualization"
We can use this to easily find where our file is located and read it.
social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in csv files
Our data should now be imported into R!
The first thing we should do whenever we import data is to see how it looks in RStudio. There are a couple of ways to do this.
Figure 4: Navigating to dataset
# Method 1 - Type in the name of the object
social_media
# Method 2 - Use the View function
View(social_media) #view automatically displays the dataset in a tab.
# Method 3 - Use the head function
head(social_media) #head displays the first 6 rows of each variable.
## id age time_on_social urban good_mood_likes bad_mood_likes followers
## 1 S1 15.2 3.06 1 22.8 46.5 173.3
## 2 S2 16.0 2.18 1 46.0 48.3 144.3
## 3 S3 16.8 1.92 1 50.8 46.1 76.5
## 4 S4 15.6 2.61 1 29.9 29.2 171.7
## 5 S5 17.1 3.24 1 37.1 52.4 109.5
## 6 S6 15.7 2.44 1 26.9 20.2 157.5
## polit_informed polit_campaign polit_activism
## 1 2.3 3.2 3.6
## 2 1.6 2.2 2.6
## 3 1.9 2.7 3.0
## 4 1.6 2.3 2.6
## 5 2.0 2.9 3.3
## 6 2.4 3.4 3.9
# Method 4 - Use the str function
str(social_media) #displays an overall summary of the object and variable structure.
## 'data.frame': 60 obs. of 10 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ age : num 15.2 16 16.8 15.6 17.1 15.7 19.7 18.6 19.6 15.5 ...
## $ time_on_social : num 3.06 2.18 1.92 2.61 3.24 2.44 1.46 1.52 1.92 2.1 ...
## $ urban : int 1 1 1 1 1 1 1 1 1 1 ...
## $ good_mood_likes: num 22.8 46 50.8 29.9 37.1 26.9 14.8 26 6.5 45.7 ...
## $ bad_mood_likes : num 46.5 48.3 46.1 29.2 52.4 20.2 35.1 35.8 12.2 32.8 ...
## $ followers : num 173.3 144.3 76.5 171.7 109.5 ...
## $ polit_informed : num 2.3 1.6 1.9 1.6 2 2.4 1.7 1.6 1.5 2.2 ...
## $ polit_campaign : num 3.2 2.2 2.7 2.3 2.9 3.4 2.4 2.2 2.1 3.1 ...
## $ polit_activism : num 3.6 2.6 3 2.6 3.3 3.9 2.7 2.6 2.4 3.5 ...
You should now have a good idea of what
PSYC2001_social-media.csv looks like in RStudio.
You will also notice that the last function, str(),
displays a summary of the object. This includes:
id, and num for all other
variablesFigure 5: You thinking
Once we have imported our dataset into R, it’s important to check the
quality and structure of the data to ensure everything looks as
expected. One simple way to do this is by using the
summary() function.
summary(social_media) #summary provides a quick overview of the data in each variable.
## id age time_on_social urban
## Length:60 Min. :13.90 Min. :-999.000 Min. :1.0
## Class :character 1st Qu.:15.70 1st Qu.: 1.920 1st Qu.:1.0
## Mode :character Median :16.50 Median : 2.365 Median :1.5
## Mean :16.87 Mean : -30.845 Mean :1.5
## 3rd Qu.:17.43 3rd Qu.: 3.042 3rd Qu.:2.0
## Max. :23.00 Max. : 4.320 Max. :2.0
## good_mood_likes bad_mood_likes followers polit_informed
## Min. : 6.50 Min. :12.20 Min. : 61.40 Min. :0.600
## 1st Qu.:31.60 1st Qu.:39.08 1st Qu.: 76.47 1st Qu.:1.500
## Median :45.90 Median :49.30 Median :116.30 Median :1.800
## Mean :43.04 Mean :49.84 Mean :124.76 Mean :1.858
## 3rd Qu.:53.40 3rd Qu.:58.75 3rd Qu.:153.75 3rd Qu.:2.200
## Max. :89.20 Max. :91.20 Max. :336.50 Max. :3.400
## polit_campaign polit_activism
## Min. :0.800 Min. :0.900
## 1st Qu.:2.100 1st Qu.:2.400
## Median :2.550 Median :2.900
## Mean :2.602 Mean :2.977
## 3rd Qu.:3.100 3rd Qu.:3.500
## Max. :4.800 Max. :5.500
time_on_social variable.
It should now be clear that this data is unusual because it has a
minimum value of -999 in the
time_on_social variable which is measured in hours (we can’t have
negative time !).
Figure 6: Back to the future !
A good question to ask now is - why are these values in the dataset?
Sometimes when collecting data, we can’t get a response from every
participant. Instead of leaving a blank, researchers will sometimes put
in a placeholder value like -999 to show that the data is
missing. These aren’t real numbers; they just mean the data wasn’t
recorded. But -999 isn’t the standard way to show missing
data in R. R uses NA to represent missing values, and
that’s important because most R functions know how to handle
NA properly—but they don’t know to ignore
-999.
Lets first have a look at how many -999 values are
present in the data. We can do this by using the filter()
function from the tidyverse package which is used to keep
(or remove) rows based on certain conditions. We can then use the
count() function from the tidyverse package to
sum the number of rows in the dataframe.
social_media_filtered <- filter(social_media, time_on_social == -999) #keep all rows where `time_on_social` is equal to -999
count(social_media_filtered) #count the total number of rows remaining the dataframe and print it to the console.
## n
## 1 2
A short aside to introduce a very specical operation called a
‘pipe’ or %>%. This operation allows you to
pass the result from one function to the next seamlessly in a sort of
assembly line like fashion. Throughout the rest of the course we
will be using ‘piping’ as it is easier to follow and code. For
instance, lets repeat what we just did above but with pipes instead.
social_media %>% #pass the values from social_media to the filter function
filter(time_on_social == -999) %>% #keep all rows that are equal to -999 and pass the result to count
count() #count the number of remaining columns
## n
## 1 2
Now lets use a piping method to clean this data up and remove
-999 and replace them with more R readable NA
values.
We can do this using the mutate() and na_if()
functions from the tidyverse package. The
mutate() function is used to alter columnsin dataframe
based on certain conditions and na_if() is used to replace
given values with NA in a dataframe.
social_media_NA <- social_media %>%
mutate(time_on_social = na_if(time_on_social,-999)) #mutate alters columns and rows.
#na_if replaces -999 with NA.
ggplot2Now let’s look at some data! We’re going to start by visualising the
time_on_social variable. Visualising helps us understand
more about the distribution of the data, which helps us understand what
kinds of analysis we can perform.
To do this we will need to use the ggplot() function.
This is the main function from the ggplot2 package
(you should know what this is from reading the
documentation). ggplot() provides the canvas of
the graph you want to make.
To make the basic canvas ggplot() requires two
things:
The data that you want it to plot.
The variables to go on the x and y axes.
Importantly, ggplot() only provides the canvas. It does
not draw anything by itself. You have to add layers to the canvas
created by ggplot() by using other functions that can
create bars, points or lines !
Here we use geom_boxplot() which creates a boxplot for
us.
social_media_NA %>%
ggplot(aes(y = time_on_social),) + #ggplot uses aesthetic (aes()) to map axes.
scale_x_discrete() + #this tells ggplot that the x-axis is categorical.
geom_boxplot() + #creates a boxplot
labs(y = "Time on Social Media") #short for "labels", use to label axes and titles.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
ggplot() is able to recognise and remove ‘NA’ values. Be
careful as not all R functions are able to do this.
ggplot()ggplot() can be customised with so many other functions
that we have shown here to make truly beautiful
looking plots. We will be learning how to do this throughout the
next few weeks.
For now lets see if you can put some of the skills you have learned
so far to good use. See if you can work out how to make a histogram of
the data using the function geom_histogram()
social_media_NA %>%
ggplot(aes(x = time_on_social)) + #ggplot uses aesthetic (aes()) to map axes.
geom_histogram() + #creates a histogram
labs(x = "Time on social media", y = "Density") #short for "labels", use to label axes and titles.
Well done ! You have completed everything you need to for this week. If you have finished in a record time please consult with your tutor about what to do next. Otherwise we will see you next lab !
Figure 6: Students reaction to this information !